Home Page - Data Preparation - **Data Exploration** - Dimensionality Reduction - Clustering Techniques - Playlist Generation - Conclusion - Authors' Gift
Our enriched master tables provides us with fertile ground for exploratory data analysis as we seek to understand the playlist groupings and song features for all songs in the provided dataset.
We begin by seeking to understand how many songs appear in each playlist as well as how many playlists each song appears in. There is a very strong right skew for song distribution, as a few songs appear in an enormously large number of playlists while the vast majority of songs appear in very few playlists overall.
The median number of songs in a playlist is about 20, though this too has a long right tail. The maximum number of songs in a single playlist in our dataset is 341.
For our newly enriched musical features, we leverage Seaborn distribution plots to understand the nature of each of these columns. Some have a relatively even distribution but even more appear to be centered around a small value. Song duration centers expectedly around 3-4 minutes, though there are a few songs with extremely long lengths. These are presumed to be songs in playlists related to sleep, where frequent switching is detrimental to a relaxing environment. Danceability, a critical feature in these author's opinions, centers around a score of 0.6 but energy has a left skew with a center around 0.8-0.9. Liveness, instrumentalness, speechiness and loudness all have fairly centered values with a few songs exhibiting different behaviors. The overall gathering of song features around a few feature values is an interesting takeaway.
We next explore whether any features exhibit correlations between themselves. The majority appear to be completely randomly related, with data distributions materializing as complete squares on the pairplot graphs. There are some potential relationships, which we explore in detail next.
Energy and loudness have the most visually linear relationship, where as energy increases so too does loudness. We have affectionality labeled this the "Haley's Comet of Music".
Tempo and danceability, two features that have the most normal distributions, appear to mirror each other as well, with peaks at a tempo of 125 and a danceability range between 0.6 and 0.8.
Charting that same relationship between tempo and danceability on a scatterplot highlights that peak concentration even better. The "Tempo Bump" occurs at about 125 beats-per-minute (bpms) with a clear jump in danceability.
We next turned to whether song features had any relationship to playlist inclusion. Are certain features more predictive of the number of playlists they are likely to be included in?
We started with danceability and see the hint of a linear relationship, where as danceability increases so too does the number of playlists it will be included in. There remains a lot of song noise though, taken from the songs that we saw in our inclusion distributions earlier as just not being popular and added to only a few playlists.
Loudness has a clear spike at about -8 but with fairly rigid hard limits above and below that. All songs softer than -20 are included in very few playlists. The same holds true for songs louder than ~0.
Energy actually appears to have very little impact on playlist inclusion, with songs at every level of energy included at high amounts. Playlists vary across their desired theme, from slower classical music to upbeat electronic, so this mirrors our expectations.
There is an expected relationship between artist_popularity as presented by Spotify and the number of playlists those songs are included in.
That expected linear relationship does not hold for album_popularity though. Instead, there is a clear delineation of playlist inclusion for both very popular albums and very unpopular albums. We posit that this is due to the phenomenon of one-hit wonders, whereby an album might be panned by critics and music fans overall but it was listened to and reviewed because it had one or two very popular songs. Those individual songs will be included in playlists, thanks to the flexibility of streaming services, even though the album overall remains unpopular. It is the middle range of albums, without any hit songs that are just ok, that are most at risk of not being included in playlists.
Song duration (duration_ms), similar to loudness, has a very narrow band of lengths that users consider acceptable for playlist inclusion. This length is about 3.5 minutes, with songs longer than ~4 minutes appearing in an extremely low number of playlists.
The last musical feature we explored was tempo, which we broke out into a more detailed histogram. Our hypothesis was that there would exist several popular tempoes for different genres of songs, something we saw hinted at in our analysis of energy. This proved true, with peaks occuring at 80, 100, 120, 128, 139 and 170, which we believe map to different genres of music.
Moving on from exploring feature relationships, we next explored how these features have been changing over time for songs released in different years. We derived a new release_year column from the album_release_date column and show a preview below. We filter to all songs released between 1950 and 2017 to ensure sufficient data for our chart and show the distribution of release years currently in Spotify's library below.
Varying features occur at different scales so we use MinMaxScaler from sklearn to scale them all between 0 and 1 and chart the changes on a single plot. The most dramatic cross over occurs with acousticness and energy, as the former declines severely starting in the 1950s and the latter rises steadily over time. The trend reverses itself temporarily during the 1980s but diverges again at a slower pace from 1990 to today. Valence, Spotify’s measure of positivity in a song, has also been declining at a slow but steady pace since about 1977.
Loudness has been reaching record heights beginning in the 1990s. This may be associated with the spike in popularity that decade of electronic music due to the proliferation of cheap music production technology.
The most recent trend occurs with danceability, spiking to its highest ever levels starting in 2010.
Lastly, we wanted to explore popular musical keys for songs as well as a sample of popular genres. If you filter to the Top 100 most included songs, there are clear preferences in song key at First Key (C♯/D♭) and Seventh Key (G, or sol). For playlists, sleep is actually a surprisingly popular genre, likely related to the number of extremely long songs. Unfortunately, beyond that, Spotify's structuring of genre makes analysis difficult. They include many specific genres for each album without specifying whether there is a primary genre. This means that there are potentially more popular genres than sleep, but in a simple histogram analysis, they will be reduced in frequency due to the varying nature of genre inclusions.